Toward General-Purpose Learning for Information Extraction

نویسنده

  • Dayne Freitag
چکیده

Two trends are evident in the recent evolution of the field of information extraction: a preference for simple, often corpus-driven techniques over linguistically sophisticated ones; and a broadening of the central problem definition to include many non-traditional text domains. This development calls for information extraction systems which are as retctrgetable and general as possible. Here, we describe SRV, a learning architecture for information extraction which is designed for maximum generality and flexibility. SRV can exploit domain-specific information, including linguistic syntax and lexical information, in the form of features provided to the system explicitly as input for training. This process is illustrated using a domain created from Reuters corporate acquisitions articles. Features are derived from two general-purpose NLP systems, Sleator and Temperly's link grammar parser and Wordnet. Experiments compare the learner's performance with and without such linguistic information. Surprisingly, in many cases, the system performs as well without this information as with it. 1 I n t r o d u c t i o n The field of information extraction (IE) is concerned with using natural language processing (NLP) to extract essential details from text documents automatically. While the problems of retrieval, routing, and filtering have received considerable attention through the years, IE is only now coming into its own as an information management sub-discipline. Progress in the field of IE has been away from general NLP systems, that must be tuned to work ill a particular domain, toward faster systems that perform less linguistic processing of documents and can be more readily targeted at novel domains (e.g., (Appelt et al., 1993)). A natural part of this development has been the introduction of machine learning techniques to facilitate the domain engineering effort (Riloff, 1996; Soderland and Lehnert, 1994). Several researchers have reported IE systems which use machine learning at their core (Soderland, 1996; Califf and Mooney, 1997). Rather than spend human effort tuning a system for an IE domain, it becomes possible to conceive of training it on a document sample. Aside from the obvious savings in human development effort, this has significant implications for information extraction as a discipline: Retargetability Moving to a novel domain should no longer be a question of code modification; at most some feature engineering should be required. Gene ra l i t y It should be possible to handle a much wider range of domains than previously. In addition to domains characterized by grammatical prose, we should be able to perform information extraction in domains involving less traditional structure, such as netnews articles and Web pages. In this paper we describe a learning algorithm similar in spirit to FOIL (Quinlan, 1990), which takes as input a set of tagged documents, and a set of features that control generalization, and produces rules that describe how to extract information from novel documents. For this system, introducing linguistic or any other information particular to a domain is an exercise in feature definition, separate from the central algorithm, which is constant. We describe a set of experiments, involving a document collection of newswire articles, in which this learner is compared with simpler learning algorithms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Relationship between Information Literacy and Access to Facilities with Attitudes toward E-learning among students of Urmia University of Medical Sciences

Introduction: E-learning is considered as one of the most important elements of higher education in the information era. The present study aimed to investigate the relationship between information literacy and access to facilities with attitudes toward e-learning among students of Urmia University of Medical Sciences. Methods: This descriptive study was performed on 190 senior students of Urmi...

متن کامل

The Interplay between Ethnic Identities and Social Attitude toward Foreign Language Learning and Language Proficiency of Young Gilak EFL Learners

 As a social-psychological phenomenon, language learning involves several factors. The two significant factors that attracted scholars’ attention recently are ethnicity and social attitude toward L2. Taking in to account this issue, the present study sought to investigate the relationship between Gilak ethnic identity, social attitude toward foreign language, and L2 proficiency...

متن کامل

Active Learning Selection Strategies for Information Extraction

The need for labeled documents is a key bottleneck in adaptive information extraction. One way to solve this problem is through active learning algorithms that require users to label only the most informative documents. We investigate several document selection strategies that are particularly relevant to information extraction. We show that some strategies are biased toward recall, while other...

متن کامل

Analysis of Recorded Rainfall Information for the Purpose of Huff Curves Extraction in the Dez Dam

Identifying the rainfall characteristics and understanding the rainfall-related processes is one of the key factors in the scientific management of water resources. Selection of the design storm is the first step in the estimation of the design flood. Determining temporal rainfall patterns is very important as one of the design rainfall properties in flood estimation and the design of drainage ...

متن کامل

Towards Unsupervised Learning of Temporal Relations between Events

Automatic extraction of temporal relations between event pairs is an important task for several natural language processing applications such as Question Answering, Information Extraction, and Summarization. Since most existing methods are supervised and require large corpora, which for many languages do not exist, we have concentrated our efforts to reduce the need for annotated data as much a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998